Assignment1

Author

Arianna Hernandez

Assignment Description

We will work with air pollution data from the U.S. Environmental Protection Agency (EPA). The EPA has a national monitoring network of air pollution sites that The primary question you will answer is whether daily concentrations of PM2.5 (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022).

A primer on particulate matter air pollution can be found here.

Your assignment should be completed in Quarto or R Markdown.

#Steps

1. Given the formulated question from the assignment description, you will now conduct EDA Checklist items 2-4. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website. Read in the data using data.table(). For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check for any data issues, particularly in the key variable we are analyzing. Make sure you write up a summary of all of your findings.

library(data.table)
#2002 data
data_2002 <- fread('2002data.csv')
data_2002 <- as.data.frame(data_2002)
dim(data_2002)
[1] 15976    22
head(data_2002)
        Date Source  Site ID POC Daily Mean PM2.5 Concentration    Units
1 01/05/2002    AQS 60010007   1                           25.1 ug/m3 LC
2 01/06/2002    AQS 60010007   1                           31.6 ug/m3 LC
3 01/08/2002    AQS 60010007   1                           21.4 ug/m3 LC
4 01/11/2002    AQS 60010007   1                           25.9 ug/m3 LC
5 01/14/2002    AQS 60010007   1                           34.5 ug/m3 LC
6 01/17/2002    AQS 60010007   1                           41.0 ug/m3 LC
  Daily AQI Value Local Site Name Daily Obs Count Percent Complete
1              81       Livermore               1              100
2              93       Livermore               1              100
3              74       Livermore               1              100
4              82       Livermore               1              100
5              98       Livermore               1              100
6             115       Livermore               1              100
  AQS Parameter Code AQS Parameter Description Method Code
1              88101  PM2.5 - Local Conditions         120
2              88101  PM2.5 - Local Conditions         120
3              88101  PM2.5 - Local Conditions         120
4              88101  PM2.5 - Local Conditions         120
5              88101  PM2.5 - Local Conditions         120
6              88101  PM2.5 - Local Conditions         120
                     Method Description CBSA Code
1 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
2 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
3 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
4 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
5 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
6 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
                          CBSA Name State FIPS Code      State County FIPS Code
1 San Francisco-Oakland-Hayward, CA               6 California                1
2 San Francisco-Oakland-Hayward, CA               6 California                1
3 San Francisco-Oakland-Hayward, CA               6 California                1
4 San Francisco-Oakland-Hayward, CA               6 California                1
5 San Francisco-Oakland-Hayward, CA               6 California                1
6 San Francisco-Oakland-Hayward, CA               6 California                1
   County Site Latitude Site Longitude
1 Alameda      37.68753      -121.7842
2 Alameda      37.68753      -121.7842
3 Alameda      37.68753      -121.7842
4 Alameda      37.68753      -121.7842
5 Alameda      37.68753      -121.7842
6 Alameda      37.68753      -121.7842
tail(data_2002)
            Date Source  Site ID POC Daily Mean PM2.5 Concentration    Units
15971 12/10/2002    AQS 61131003   1                             15 ug/m3 LC
15972 12/13/2002    AQS 61131003   1                             15 ug/m3 LC
15973 12/22/2002    AQS 61131003   1                              1 ug/m3 LC
15974 12/25/2002    AQS 61131003   1                             23 ug/m3 LC
15975 12/28/2002    AQS 61131003   1                              5 ug/m3 LC
15976 12/31/2002    AQS 61131003   1                              6 ug/m3 LC
      Daily AQI Value      Local Site Name Daily Obs Count Percent Complete
15971              62 Woodland-Gibson Road               1              100
15972              62 Woodland-Gibson Road               1              100
15973               6 Woodland-Gibson Road               1              100
15974              77 Woodland-Gibson Road               1              100
15975              28 Woodland-Gibson Road               1              100
15976              33 Woodland-Gibson Road               1              100
      AQS Parameter Code AQS Parameter Description Method Code
15971              88101  PM2.5 - Local Conditions         117
15972              88101  PM2.5 - Local Conditions         117
15973              88101  PM2.5 - Local Conditions         117
15974              88101  PM2.5 - Local Conditions         117
15975              88101  PM2.5 - Local Conditions         117
15976              88101  PM2.5 - Local Conditions         117
                         Method Description CBSA Code
15971 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15972 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15973 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15974 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15975 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15976 R & P Model 2000 PM2.5 Sampler w/WINS     40900
                                    CBSA Name State FIPS Code      State
15971 Sacramento--Roseville--Arden-Arcade, CA               6 California
15972 Sacramento--Roseville--Arden-Arcade, CA               6 California
15973 Sacramento--Roseville--Arden-Arcade, CA               6 California
15974 Sacramento--Roseville--Arden-Arcade, CA               6 California
15975 Sacramento--Roseville--Arden-Arcade, CA               6 California
15976 Sacramento--Roseville--Arden-Arcade, CA               6 California
      County FIPS Code County Site Latitude Site Longitude
15971              113   Yolo      38.66121      -121.7327
15972              113   Yolo      38.66121      -121.7327
15973              113   Yolo      38.66121      -121.7327
15974              113   Yolo      38.66121      -121.7327
15975              113   Yolo      38.66121      -121.7327
15976              113   Yolo      38.66121      -121.7327
names(data_2002)
 [1] "Date"                           "Source"                        
 [3] "Site ID"                        "POC"                           
 [5] "Daily Mean PM2.5 Concentration" "Units"                         
 [7] "Daily AQI Value"                "Local Site Name"               
 [9] "Daily Obs Count"                "Percent Complete"              
[11] "AQS Parameter Code"             "AQS Parameter Description"     
[13] "Method Code"                    "Method Description"            
[15] "CBSA Code"                      "CBSA Name"                     
[17] "State FIPS Code"                "State"                         
[19] "County FIPS Code"               "County"                        
[21] "Site Latitude"                  "Site Longitude"                
str(data_2002)
'data.frame':   15976 obs. of  22 variables:
 $ Date                          : chr  "01/05/2002" "01/06/2002" "01/08/2002" "01/11/2002" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Daily Mean PM2.5 Concentration: num  25.1 31.6 21.4 25.9 34.5 41 29.3 15 18.8 37.9 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily AQI Value               : int  81 93 74 82 98 115 89 62 69 107 ...
 $ Local Site Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ Daily Obs Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS Parameter Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS Parameter Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ Method Code                   : int  120 120 120 120 120 120 120 120 120 120 ...
 $ Method Description            : chr  "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" ...
 $ CBSA Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ State FIPS Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County FIPS Code              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ Site Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ Site Longitude                : num  -122 -122 -122 -122 -122 ...
nrow(is.na(data_2002$"2002 Daily Mean PM2.5 Concentration"))
NULL
summary(data_2002[,5])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    7.00   12.00   16.12   20.50  104.30 
# 2022 data
data_2022 <- fread('2022data.csv')
data_2022 <- as.data.frame(data_2022)
dim(data_2022)
[1] 59756    22
head(data_2022)
        Date Source  Site ID POC Daily Mean PM2.5 Concentration    Units
1 01/01/2022    AQS 60010007   3                           12.7 ug/m3 LC
2 01/02/2022    AQS 60010007   3                           13.9 ug/m3 LC
3 01/03/2022    AQS 60010007   3                            7.1 ug/m3 LC
4 01/04/2022    AQS 60010007   3                            3.7 ug/m3 LC
5 01/05/2022    AQS 60010007   3                            4.2 ug/m3 LC
6 01/06/2022    AQS 60010007   3                            3.8 ug/m3 LC
  Daily AQI Value Local Site Name Daily Obs Count Percent Complete
1              58       Livermore               1              100
2              60       Livermore               1              100
3              39       Livermore               1              100
4              21       Livermore               1              100
5              23       Livermore               1              100
6              21       Livermore               1              100
  AQS Parameter Code AQS Parameter Description Method Code
1              88101  PM2.5 - Local Conditions         170
2              88101  PM2.5 - Local Conditions         170
3              88101  PM2.5 - Local Conditions         170
4              88101  PM2.5 - Local Conditions         170
5              88101  PM2.5 - Local Conditions         170
6              88101  PM2.5 - Local Conditions         170
                    Method Description CBSA Code
1 Met One BAM-1020 Mass Monitor w/VSCC     41860
2 Met One BAM-1020 Mass Monitor w/VSCC     41860
3 Met One BAM-1020 Mass Monitor w/VSCC     41860
4 Met One BAM-1020 Mass Monitor w/VSCC     41860
5 Met One BAM-1020 Mass Monitor w/VSCC     41860
6 Met One BAM-1020 Mass Monitor w/VSCC     41860
                          CBSA Name State FIPS Code      State County FIPS Code
1 San Francisco-Oakland-Hayward, CA               6 California                1
2 San Francisco-Oakland-Hayward, CA               6 California                1
3 San Francisco-Oakland-Hayward, CA               6 California                1
4 San Francisco-Oakland-Hayward, CA               6 California                1
5 San Francisco-Oakland-Hayward, CA               6 California                1
6 San Francisco-Oakland-Hayward, CA               6 California                1
   County Site Latitude Site Longitude
1 Alameda      37.68753      -121.7842
2 Alameda      37.68753      -121.7842
3 Alameda      37.68753      -121.7842
4 Alameda      37.68753      -121.7842
5 Alameda      37.68753      -121.7842
6 Alameda      37.68753      -121.7842
tail(data_2022)
            Date Source  Site ID POC Daily Mean PM2.5 Concentration    Units
59751 12/01/2022    AQS 61131003   1                            3.4 ug/m3 LC
59752 12/07/2022    AQS 61131003   1                            3.8 ug/m3 LC
59753 12/13/2022    AQS 61131003   1                            6.0 ug/m3 LC
59754 12/19/2022    AQS 61131003   1                           34.8 ug/m3 LC
59755 12/25/2022    AQS 61131003   1                           23.2 ug/m3 LC
59756 12/31/2022    AQS 61131003   1                            1.0 ug/m3 LC
      Daily AQI Value      Local Site Name Daily Obs Count Percent Complete
59751              19 Woodland-Gibson Road               1              100
59752              21 Woodland-Gibson Road               1              100
59753              33 Woodland-Gibson Road               1              100
59754              99 Woodland-Gibson Road               1              100
59755              77 Woodland-Gibson Road               1              100
59756               6 Woodland-Gibson Road               1              100
      AQS Parameter Code AQS Parameter Description Method Code
59751              88101  PM2.5 - Local Conditions         145
59752              88101  PM2.5 - Local Conditions         145
59753              88101  PM2.5 - Local Conditions         145
59754              88101  PM2.5 - Local Conditions         145
59755              88101  PM2.5 - Local Conditions         145
59756              88101  PM2.5 - Local Conditions         145
                                         Method Description CBSA Code
59751 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59752 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59753 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59754 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59755 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59756 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
                                    CBSA Name State FIPS Code      State
59751 Sacramento--Roseville--Arden-Arcade, CA               6 California
59752 Sacramento--Roseville--Arden-Arcade, CA               6 California
59753 Sacramento--Roseville--Arden-Arcade, CA               6 California
59754 Sacramento--Roseville--Arden-Arcade, CA               6 California
59755 Sacramento--Roseville--Arden-Arcade, CA               6 California
59756 Sacramento--Roseville--Arden-Arcade, CA               6 California
      County FIPS Code County Site Latitude Site Longitude
59751              113   Yolo      38.66121      -121.7327
59752              113   Yolo      38.66121      -121.7327
59753              113   Yolo      38.66121      -121.7327
59754              113   Yolo      38.66121      -121.7327
59755              113   Yolo      38.66121      -121.7327
59756              113   Yolo      38.66121      -121.7327
names(data_2022)
 [1] "Date"                           "Source"                        
 [3] "Site ID"                        "POC"                           
 [5] "Daily Mean PM2.5 Concentration" "Units"                         
 [7] "Daily AQI Value"                "Local Site Name"               
 [9] "Daily Obs Count"                "Percent Complete"              
[11] "AQS Parameter Code"             "AQS Parameter Description"     
[13] "Method Code"                    "Method Description"            
[15] "CBSA Code"                      "CBSA Name"                     
[17] "State FIPS Code"                "State"                         
[19] "County FIPS Code"               "County"                        
[21] "Site Latitude"                  "Site Longitude"                
str(data_2022)
'data.frame':   59756 obs. of  22 variables:
 $ Date                          : chr  "01/01/2022" "01/02/2022" "01/03/2022" "01/04/2022" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  3 3 3 3 3 3 3 3 3 3 ...
 $ Daily Mean PM2.5 Concentration: num  12.7 13.9 7.1 3.7 4.2 3.8 2.3 6.9 13.6 11.2 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily AQI Value               : int  58 60 39 21 23 21 13 38 59 55 ...
 $ Local Site Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ Daily Obs Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS Parameter Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS Parameter Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ Method Code                   : int  170 170 170 170 170 170 170 170 170 170 ...
 $ Method Description            : chr  "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" ...
 $ CBSA Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ State FIPS Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County FIPS Code              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ Site Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ Site Longitude                : num  -122 -122 -122 -122 -122 ...
nrow(is.na(data_2022$"2022 Daily Mean PM2.5 Concentration"))
NULL
summary(data_2022[,5])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -6.700   4.100   6.800   8.428  10.700 302.500 
sum(data_2022[,5] < 0)
[1] 215

The 2002 data has 15,976 rows of observations, while the 2022 data has 59,756 rows of observations. Both of these data sets have 22 columns, matching in variable names and data types presented for these data points. There are also no missing values for the key variable we are testing for neither of the data points.

Values for 2002 PM 2.5 daily concentrations were not concerning, with a minimum daily value of 0, and a maximum of 104.3. There are 215 measurements that read negative values for the 2022 PM 2.5 Daily mean, with the minimum concentration being -6.7, which is likely a measurement error.

2. Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.

library(tidyverse)
library(dplyr)
data_2002 <- data_2002 %>%
  mutate(year = factor(2002))
names(data_2002)[5] <- "PM"
names(data_2002)[21] <- "lat"
names(data_2002)[22] <- "lon"

data_2022 <- data_2022 %>%
  mutate(year = factor(2022))
names(data_2022)[5] <- "PM"
names(data_2022)[21] <- "lat"
names(data_2022)[22] <- "lon"

mergedData <- data_2002 %>%
  bind_rows(data_2022)
mergedData <- mergedData %>%
  mutate(month= format(as.Date(mergedData$Date, format="%m/%d/%Y"),"%m"))

3. Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.

library(leaflet)
loc.pal <- colorFactor(c('darkgreen','goldenrod'), domain=mergedData$year)

#map includes both years
sitemap <- leaflet(mergedData) |> 
     addProviderTiles('CartoDB.Positron') |> 
  addCircles(
    lat = ~lat, lng=~lon,
    label = ~paste0(year), color = ~ loc.pal(year),
    opacity = 1, fillOpacity = .5, radius = 500
    ) |>
  setView(lng = mean(mergedData$lon, na.rm = TRUE), 
          lat = mean(mergedData$lat, na.rm = TRUE), 
          zoom = 5) |>
  addLegend('bottomleft', pal=loc.pal, values=mergedData$year,
            title='Year', opacity=1)
sitemap
#map for 2002
map2002 <- leaflet(data_2002) |> 
    addProviderTiles('CartoDB.Positron') |> 
    # Some circles
    addCircles(
      lat = ~lat, lng=~lon,
                                                    # HERE IS OUR PAL!
      label = ~paste0(year), color = ~ loc.pal(year),
      opacity = 1, fillOpacity = .5, radius = 500
      ) |>
    setView(lng = mean(data_2002$lon, na.rm = TRUE), 
          lat = mean(data_2002$lat, na.rm = TRUE), 
          zoom = 5) |>
    # And a pretty legend
    addLegend('bottomleft', pal=loc.pal, values=data_2002$year,
            title='Year', opacity=1)
map2002
#map for 2022
map2022 <- leaflet(data_2022) |> 
    addProviderTiles('CartoDB.Positron') |> 
    # Some circles
    addCircles(
      lat = ~lat, lng=~lon,
                                                    # HERE IS OUR PAL!
      label = ~paste0(year), color = ~ loc.pal(year),
      opacity = 1, fillOpacity = .5, radius = 500
      ) |>
    setView(lng = mean(data_2002$lon, na.rm = TRUE), 
          lat = mean(data_2002$lat, na.rm = TRUE), 
          zoom = 5) |>
    # And a pretty legend
    addLegend('bottomleft', pal=loc.pal, values=data_2022$year,
            title='Year', opacity=1)
map2022

Though most of the 2002 locations seem to have remained the same, there are many more locations for the 2022 data sites. Most of the clustering occurs around the coast, the majority of the sites being near the Los Angeles area and San Francisco. The least amount of sites can be found toward the middle/east of the state.

4. Check for any missing or implausible values of PM2.5 in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.

hist(mergedData$PM)

boxplot(mergedData$PM)

summary(mergedData$PM)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -6.70    4.50    7.60   10.05   12.20  302.50 
lowvalues <- mergedData %>%
  group_by(year) %>%
  filter(PM < 0) %>%
  arrange(PM)
head(lowvalues)
# A tibble: 6 × 24
# Groups:   year [1]
  Date    Source `Site ID`   POC    PM Units `Daily AQI Value` `Local Site Name`
  <chr>   <chr>      <int> <int> <dbl> <chr>             <int> <chr>            
1 09/20/… AQS     60571001     5  -6.7 ug/m…                 0 Truckee-Fire Sta…
2 09/19/… AQS     60571001     5  -6.3 ug/m…                 0 Truckee-Fire Sta…
3 09/21/… AQS     60571001     5  -5.1 ug/m…                 0 Truckee-Fire Sta…
4 09/03/… AQS     60571001     5  -4.7 ug/m…                 0 Truckee-Fire Sta…
5 09/22/… AQS     60571001     5  -4.7 ug/m…                 0 Truckee-Fire Sta…
6 09/04/… AQS     60571001     5  -4.1 ug/m…                 0 Truckee-Fire Sta…
# ℹ 16 more variables: `Daily Obs Count` <int>, `Percent Complete` <dbl>,
#   `AQS Parameter Code` <int>, `AQS Parameter Description` <chr>,
#   `Method Code` <int>, `Method Description` <chr>, `CBSA Code` <int>,
#   `CBSA Name` <chr>, `State FIPS Code` <int>, State <chr>,
#   `County FIPS Code` <int>, County <chr>, lat <dbl>, lon <dbl>, year <fct>,
#   month <chr>
tail(lowvalues)
# A tibble: 6 × 24
# Groups:   year [1]
  Date    Source `Site ID`   POC    PM Units `Daily AQI Value` `Local Site Name`
  <chr>   <chr>      <int> <int> <dbl> <chr>             <int> <chr>            
1 12/11/… AQS     60870007     3  -0.1 ug/m…                 0 Santa Cruz       
2 07/27/… AQS     60970004     3  -0.1 ug/m…                 0 Sebastopol       
3 12/11/… AQS     61070009     1  -0.1 ug/m…                 0 Sequoia & Kings …
4 04/23/… AQS     61130004     3  -0.1 ug/m…                 0 Davis-UCD Campus 
5 11/02/… AQS     61130004     3  -0.1 ug/m…                 0 Davis-UCD Campus 
6 11/03/… AQS     61130004     3  -0.1 ug/m…                 0 Davis-UCD Campus 
# ℹ 16 more variables: `Daily Obs Count` <int>, `Percent Complete` <dbl>,
#   `AQS Parameter Code` <int>, `AQS Parameter Description` <chr>,
#   `Method Code` <int>, `Method Description` <chr>, `CBSA Code` <int>,
#   `CBSA Name` <chr>, `State FIPS Code` <int>, State <chr>,
#   `County FIPS Code` <int>, County <chr>, lat <dbl>, lon <dbl>, year <fct>,
#   month <chr>
lowvalues %>%
   ggplot(mapping = aes(x = month, y = PM, fill = factor(year))) +
  scale_y_continuous(trans = "reverse") +
  geom_bar(position = "dodge", stat = "identity")

highvalues<- mergedData %>%
   group_by(year) %>%
   filter(PM > 75) %>%
  arrange(PM)
head(highvalues)
# A tibble: 6 × 24
# Groups:   year [2]
  Date    Source `Site ID`   POC    PM Units `Daily AQI Value` `Local Site Name`
  <chr>   <chr>      <int> <int> <dbl> <chr>             <int> <chr>            
1 11/24/… AQS     60290014     3  75.1 ug/m…               165 Bakersfield-Cali…
2 04/02/… AQS     60658001     6  75.1 ug/m…               165 Rubidoux         
3 02/02/… AQS     60290014     1  75.3 ug/m…               165 Bakersfield-Cali…
4 08/27/… AQS     60431001     3  75.3 ug/m…               165 Yosemite NP-Yose…
5 09/12/… AQS     60570005     3  75.3 ug/m…               165 Grass Valley-Lit…
6 04/02/… AQS     60651003     1  75.5 ug/m…               165 Riverside (Magno…
# ℹ 16 more variables: `Daily Obs Count` <int>, `Percent Complete` <dbl>,
#   `AQS Parameter Code` <int>, `AQS Parameter Description` <chr>,
#   `Method Code` <int>, `Method Description` <chr>, `CBSA Code` <int>,
#   `CBSA Name` <chr>, `State FIPS Code` <int>, State <chr>,
#   `County FIPS Code` <int>, County <chr>, lat <dbl>, lon <dbl>, year <fct>,
#   month <chr>
tail(highvalues)
# A tibble: 6 × 24
# Groups:   year [1]
  Date    Source `Site ID`   POC    PM Units `Daily AQI Value` `Local Site Name`
  <chr>   <chr>      <int> <int> <dbl> <chr>             <int> <chr>            
1 09/10/… AQS     60570005     3  218. ug/m…               293 Grass Valley-Lit…
2 09/10/… AQS     60610004     3  244. ug/m…               338 Colfax-City Hall 
3 08/15/… AQS     61050002     1  245. ug/m…               339 Weaverville-Cour…
4 08/14/… AQS     61050002     1  246. ug/m…               342 Weaverville-Cour…
5 09/16/… AQS     60611004     3  296. ug/m…               442 Tahoe City-Fairw…
6 07/31/… AQS     60932001     3  302. ug/m…               454 Yreka            
# ℹ 16 more variables: `Daily Obs Count` <int>, `Percent Complete` <dbl>,
#   `AQS Parameter Code` <int>, `AQS Parameter Description` <chr>,
#   `Method Code` <int>, `Method Description` <chr>, `CBSA Code` <int>,
#   `CBSA Name` <chr>, `State FIPS Code` <int>, State <chr>,
#   `County FIPS Code` <int>, County <chr>, lat <dbl>, lon <dbl>, year <fct>,
#   month <chr>
highvalues %>%
   ggplot(mapping = aes(x = month, y = PM, fill = factor(year))) +
   geom_bar(position = "dodge", stat = "identity")

The histogram and box plot of daily averages show that the majority of the values are located between 0 and 25, though, as mentioned earlier, there are 215 values below 0 that should not be there, with the highest value being in the month of November. Similarly, we can see that there are many out liars that can be cause of concern, but not necessarily impossible values. With 144 values above 75 ug/m3 LC, the majority of those high values were taken in 2002 with a high of 104.3, the highest values in the data set are all from 2022, from June to November.

5. Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.

-   state
library(ggplot2)
mergedData|>
  ggplot(mapping = aes(x = State, y = PM, fill = year)) +
  geom_bar(position = "dodge", stat = "identity") +
  scale_fill_brewer(palette = "Accent")

mergedData|>
  ggplot(mapping = aes(x = State, y = PM, fill = year)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Accent")

mergedData|>
  ggplot(mapping = aes(x = State, y=PM, color = year)) +
  geom_point(position = "jitter", aes(alpha = 0.5)) +
  scale_fill_brewer(palette = "Accent") +
  geom_smooth(method = lm, se = FALSE, col = "black") +
  theme(axis.text.x = element_text(angle = 90))
`geom_smooth()` using formula = 'y ~ x'

  • Looking at just the histogram, PM 2.5 Levels have increased its maximum levels from 104.3 to 302.5 in 2022 from 2002. However, other graphs show that the median was much higher in 2002 than 2022, with the majority of the data being almost half of the values from 2002. However, 2022 has much higher outliars than 2002, by almost 200 ug/m3 LC.
library(ggplot2)
library(RColorBrewer)
mergedData |>
  ggplot(mapping = aes(x = County, y = PM, fill = year)) +
  geom_bar(position = "dodge", stat = "identity") +
  scale_fill_brewer(palette = "Pastel2")+
  theme(axis.text.x = element_text(angle = 90))

mergedData|>
  ggplot(mapping = aes(x = County, y=PM, color = year)) +
  geom_point(position = "jitter", aes(alpha = 0.5)) +
  scale_fill_brewer(palette = "Pastel2") +
  theme(axis.text.x = element_text(angle = 90))+
  facet_wrap(~ year, nrow = 2)

which(highvalues$PM> 295 )
[1] 143 144

Looking at the counties, we can see that 2002 data was much more evenly distributed than 2022 data. The highest value for 2002 was found in Kern county, while we can see consistently lower values in the majority of counties than 2002, the variance in the outliars is much higher for the 2022 values, with the highest value in the Siskiyou county, closely followed by Placer County. - site in Los Angeles

 lacounty<-  mergedData %>%
  filter(County == "Los Angeles")
lacounty |>
  ggplot(mapping = aes(x = year, y = PM, fill = year)) +
  geom_bar(position = "dodge", stat = "identity")+
  scale_fill_brewer(palette = "Pastel1")

lacounty|>
  ggplot(mapping = aes(x = year, y = PM, fill = year)) +
  geom_boxplot() +
  scale_fill_brewer(palette =  "Pastel1")

lacounty|>
  ggplot(mapping = aes(x = year, y=PM, color = year)) +
  geom_point(position = "jitter", aes(alpha = 0.5)) +
  scale_fill_brewer(palette = "Accent") +
  theme(axis.text.x = element_text(angle = 90))

summary(lacounty$PM)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -1.20    7.90   11.40   13.32   16.00   72.40 
summary(lacounty[lacounty$year == 2002, "PM"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.60   11.10   17.40   19.66   25.50   72.40 
summary(lacounty[lacounty$year == 2022, "PM"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -1.20    7.40   10.30   10.97   13.70   56.00 

The data for LA County shows that the highest concentrations of PM 2.5 was in 2002, with a median of 17.4 ug/m3 LC, and a maximum value of 72.4 ug/m3 LC. PM 2.5 values have significantly gone down in 2022, with a median of 10.3 ug/m3 LC and a max of 56.00ug/m3. 2022 concentrations of PM 2.5 are consistently below 20 ug/m3, with few outliers while 2002 concentrations are more scattered over a higher variance of values.